AD-A227  803 


& 


DHC  FILE  COPY 


* 


UMIACS-TR -90-40  March  1990 

CS-TR-2437 


Recursive  Star-Tree  Parallel  Data-Structure 

Omer  Berkman  and  Uzi  Vishkin  t 
Institute  for  Advanced  Computer  Studies  and 
fDepartment  of  Electrical  Engineering 
University  of  Maryland 
College  Park,  MD  20742 
and 

Tel  Aviv  University 


COMPUTER  SCIENCE 
TECHNICAL  REPORT  SERIES 


DTIC 

ELECTE 
0CT.17. 1980 

e*B 


UNIVERSITY  OF  MARYLAND 

COLLEGE  PARK,  MARYLAND 
20742 


marimunoN  gWrttBffTT 

Approved  far  pa bile  nhoM) 

Dtxrtbetioo  Uaflnttod 


90  40  09  1*8 


UMIACS-TR-90-40 

CS-TR-2437 


March  1990 


Recursive  Star-Tree  Parallel  Data-Structure 


Omer  Berkman  and  Uzi  Vishkinf 

Institute  for  Advanced  Computer  Studies  and 
■{•Department  of  Electrical  Engineering 
University  of  Maryland 
College  Park,  MD  20742 
and 

Tel  Aviv  University 


DTIC 

ELECTE 
OCT  17.1990 

B 


ABSTRACT 


This  paper  introduces  a  novel  parallel  data- structure,  called  recursive 
STAR-tree  (denoted  ’*  tree’)-  For  its  definition,  we  use  a  generalization  of 
the  *  functional Using  recursion  in  the  spirit  of  the  inverse- Ackermann 
function,  we  derive  recursive  *-trees. 

The  recursive  *-tree  data-structure  leads  to  a  new  design  paradigm  for 
parallel  algorithms.  This  paradigm  allows  for: 

•  Extremely  fast  parallel  computations.  Specifically,  O(o.(n))  time  (where 
a(n )  is  the  inverse  of  Ackermann  function)  using  an  optimal  number  of  pro¬ 
cessors  on  the  (weakest)  CRCW  PRAM. 

•  These  computations  need  only  constant  time,  using  an  optimal  number  of 
processors  if  the  following  non-standard  assumption  about  the  model  of 
parallel  computation  is  added  to  the  CRCW  PRAM:  an  extremely  small 
number  of  processors  can  write  simultaneously  each  into  different  bits  of  the 
same  word. 

Applications  include: 


(1) 

(2) 

(3) 

(4) 


A  new  algorithm  for  finding  lowest 
considerably  simpler  than  the  known 

Restricted  domain  merging. 
Parentheses  matching. 

A  new  parallel  reducibility. 


common  ancestors  in  trees  which  is 
algorithms  for  the  problem. 
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1  Given  a  real  function  /,  denote  / (1)(n )  =  / (a )  and  /(,)  (n)  =  /(f  (n))  for  i  >  1.  The  *  functional  maps  / 

into  another  function  */ .  */(«)  =  minimum  [i  I  / (l '(n )  S  1 ) .  If  this  minimum  does  not  exist  then  *f  (n)  -  «. 


1.  Introduction 

The  model  of  parallel  computation  that  is  used  in  this  paper  is  the  concurrent-read 
concurrent-write  (CRCW)  parallel  random  access  machine  (PRAM).  We  assume  that  several  pro¬ 
cessors  may  attempt  to  write  at  the  same  memory  location  only  if  they  are  seeking  to  write  the 
same  value  (the  so  called.  Common  CRCW  PRAM).  We  use  the  weakest  Common  CRCW 
PRAM  model,  in  which  only  concurrent  writes  of  the  value  one  are  allowed.  Given  two  parallel 
algorithms  for  the  same  problem  one  is  more  efficient  than  the  other  if:  (1)  primarily,  its  time- 
processor  product  is  smaller,  and  (2)  secondarily  (but  important),  its  parallel  time  is  smaller. 
Optimal  parallel  algorithms  are  those  with  a  linear  time-processor  product.  A  fully-parallel  algo¬ 
rithm  is  a  parallel  algorithm  that  runs  in  constant  time  using  an  optimal  number  of  processors.  An 
almost  fully-parallel  algorithm  is  a  parallel  algorithm  that  runs  in  a  (n)  (the  inverse  of  Ackermann 
function)  time  using  an  optimal  number  of  processors. 

The  notion  of  fully-parallel  algorithm  represents  an  ultimate  theoretical  goal  for  designers  of 
parallel  algorithms.  Research  on  lower  bounds  for  parallel  computation  (see  references  later)  indi¬ 
cates  that  for  nearly  any  interesting  problem  this  goal  is  unachievable.  These  same  results  also 
preclude  almost  fully-parallel  algorithms  for  the  same  problems.  Therefore,  any  result  that 

approaches  this  goal  is  somewhat  surprising.  ;/  t.  r _ 

;  v 

The  class  of  doubly  logarithmic  optimal  parallel  algorithms  and  the  challenge  of  designing 
such  algorithms  is  discussed  in  [BBGSV-89].  The  class  of  almost  fully-parallel  algorithms 
represents  an  even  more  strict  demand. 


There  is  a  remarkably  small  number  of  problems  for  which  there  exist  optimal  parallel  algo¬ 
rithms  that  run  in  o  (loglogn )  time.  These  problems  include:  (a)  OR  and  AND  of  n  bits,  (b)  Find¬ 
ing  the  minimum  among  n  elements,  where  the  input  consists  of  integers  in  the  domain  [l,...,nc] 
for  a  constant  c.  See  Fich,  Ragde  and  Wigderson  [FRW-84].  (c)  log(i)n -coloring  of  a  cycle 
where  log(t)  is  the  k  ’th  iterate  of  the  log  function  and  k  is  constant  [CV-86a].  (d)  Some  proba¬ 
bilistic  computational  geometry  problems,  [S-88].  (e)  Matching  a  pattern  string  in  a  text  string, 
following  a  processing  stage  in  which  a  table  based  on  the  pattern  is  built  [Vi-89]. 


Not  only  that  the  number  of  such  upper  bounds  is  small,  there  is  evidence  that  for  almost  any 

interesting  problem  an  o(loglogn)  time  optimal  upper  bound  is  impossible.  We  mention  time  - 

lower  bounds  for  a  few  very  simple  problems.  Clearly,  these  lower  bounds  apply  to  more  involved  x~ - 

problems.  For  brevity,  only  lower  bounds  for  optimal  speed  up  algorithms  are  stated,  (a)  Parity  of  □ 

n  bits.  The  time  lower  bound  is  Q(logn  /loglogn ).  This  follows  from  the  lower  bound  of  [H-86] ,  P 
for  circuits  together  with  the  general  simulation  result  of  [StV-84],  or  from  Beame  and  Hastad  - 

[BHa-87].  (b)  Finding  the  minimum  among  n  elements.  Lower  bound:  Q(loglogn )  on  comparison 1 
model,  [Va-75].  Same  lower  bound  for  (c)  Merging  two  sorted  arrays  of  numbers,  [BHo-85]. 
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The  main  contribution  of  this  paper  is  a  parallel  data-structure,  called  recursive  *-tree.  This 
data-structure  provides  also  a  new  paradigm  for  parallel  algorithms.  There  are  two  known  exam¬ 
ples  where  tree  based  data-structures  provide  a  "skeleton"  for  parallel  algorithms: 

(1)  Balanced  binary  trees.  The  depth  of  a  balanced  binary  tree  with  n  leaves  is  log n. 

(2)  "Doubly  logarithmic"  balanced  trees.  The  depth  of  such  a  tree  with  n  leaves  is  Ioglogn .  Each 
node  of  a  doubly  logarithmic  tree,  whose  rooted  subtree  has  x  leaves,  has  ^Ix  children. 

Balanced  binary  trees  are  used  in  the  prefix-sums  algorithm  of  [LF-80]  (that  is  perhaps  the  most 
heavily  used  routine  in  parallel  computation)  and  in  many  other  logarithmic  time  algorithms. 
[BBGSV-89]  show  how  to  apply  doubly  logarithmic  trees  for  guiding  the  flow  of  the  computation 
in  several  doubly  logarithmic  algorithms  (including  some  previously  known  algorithms).  Simi¬ 
larly,  the  recursive  *-tree  data-structure  provides  a  new  pattern  for  almost  fully-parallel  algorithms. 

In  order  to  be  able  to  list  results  obtained  by  application  of  *-trees  we  define  the  following 
family  of  extremely  slow  growing  functions.  Our  definition  is  direct.  A  subsequent  comment 
explains  how  this  definition  leads  to  an  alternative  definition  of  the  inverse-Ackermann  function. 
For  a  more  standard  definition  see  [Ta-75].  We  remark  that  such  direct  definition  is  implicit  in 
several  "inverse  Ackermann  related"  serial  algorithms  (e.g.,  [HS-86]). 

The  Inverse-Ackermann  function 

Consider  a  real  function  / .  Let  /  *  denote  the  i  -th  iterate  of  / .  (Formally,  we  denote 
/(1)(n)  -f(n)  and  /(<)(n)  =  / (f  (<_1)(n))  for  i  2  1.)  Next,  we  define  the  *  (pronounced  "star") 
functional  that  maps  the  function  /  into  another  function  */ . 
*f  (n)  -  minimum  { i  \f^‘\n)<,  1}.  ( Rotational  comment.  Note  that  the  function  log*  will  be 
denoted  *  log  using  our  notation.  This  change  is  for  notational  convenience.) 

We  define  inductively  a  series  Ik  of  slow  growing  functions: 

(i)  f0(n)  =  n- 2,  and  (ii)  lk  =  *Ik _j. 

The  first  four  in  this  series  are  familiar  functions:  /0(«)  =  n- 2,  lx(n)  =  [n! 2_|,  l2(n)  =  \_logn\  and 
f  3  (« )  =  L*  lognj . 

The  "inverse-Ackermann"  function  is  a (n)  =  minimum  {t  I /,(«)<!).  See  the  following 

comment. 

Comment.  Ackermann ’s  function  is  defined  as  follows: 

A  (0,0)  =  0;  A  (i  ,0)  =  1,  for  i  >  0;  A  (0,y )  =  j+ 2,  for  j  >  0;  and 
A(i,j)  -  A  (t'-M  (t  ,7-1),  for  i,j  >  0. 

It  is  interesting  to  note  that  Ik  is  actually  the  inverse  of  the  k  ’th  recursion  level  of  A,  the  Acker¬ 
mann  function.  Namely:  Ik(n)  =  minimum  [i  I  A(kJ)  2  n )  or  lk(A  (k,n))  =  n.  The  definition  of 
a(n)  is  equivalent  to  the  more  often  used  definition  (but  perhaps  less  intuitive):  minimum 
{i  I  A  (i,i)  2  n  }. 
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Applications  of  recursive  *-trees 

1.  The  lowest-common-ancestor  (LCA)  problem.  Suppose  a  rooted  tree  T  is  given  for  prepro¬ 
cessing.  The  preprocessing  should  enable  a  single  processor  to  process  quickly  queries  of  the  fol¬ 
lowing  form.  Given  two  vertices  u  and  v ,  find  their  lowest  common  ancestor  in  T . 

Results,  (i)  Preprocessing  in  /m(n)  time  using  an  optimal  number  of  processors.  Queries  will  be 
processed  in  0(m)  time  that  is  0(1)  time  for  constant  m.  A  more  specific  result  is:  (ii)  almost 
fully-parallel  preprocessing  and  O  (a(n ))  for  processing  a  query.  These  results  assume  that  the 
Euler  tour  of  the  tree  and  the  level  of  each  vertex  in  the  tree  are  given.  Without  this  assumption 
the  time  for  preprocessing  is  O  (logn ),  using  an  optimal  number  of  processors,  and  each  query  can 
be  processed  in  constant  time.  For  a  serial  implementation  the  preprocessing  time  is  linear  and  a 
query  can  be  processed  in  constant  time. 

Significance:  Our  algorithm  for  the  LCA  problem  is  new  and  is  based  on  a  completely  different 
approach  than  the  serial  algorithm  of  Harel  and  Tarjan  [HT-84]  and  the  simplified  and  paralleliz- 
able  algorithm  of  Schieber  and  Vishkin  [ScV-88J.  Its  serial  version  is  considerably  simpler  than 
these  two  algorithms.  Specifically,  consider  the  Euler  tour  of  the  tree  and  replace  each  vertex  in 
the  tour  by  its  level.  This  gives  a  sequence  of  integers.  Unlike  previous  approaches  the  new  LCA 
algorithm  is  based  only  on  analysis  of  this  sequence  of  integers.  This  provides  another  interesting 
example  where  the  quest  for  parallel  algorithms  enriches  also  the  field  of  serial  algorithms.  Algo¬ 
rithms  for  quite  a  few  problems  use  an  LCA  algorithm  as  a  subroutine.  We  mention  some:  (1) 
Strong  orientation  [Vi-85];  (2)  Computing  open  ear-decomposition  and  st-numbering  of  a  bicon- 
nected  graph  [MSV-86];  also  [FRT-89]  and  [RR-89]  use  as  a  subroutine  an  algorithm  for  st- 
numbering  and  thus  also  the  LCA  algorithm.  (3)  Approximate  string  matching  on  strings  [LV-88] 
and  on  trees  [SZ-89],  and  retrieving  information  on  strings  from  their  suffix  trees  [AILSV-88], 

2.  The  all  nearest  zero  bit  problem.  Let  A  =(a  1,<32,  •  •  •  ,an)  be  an  array  of  bits.  Find  for 
each  bit  ax  the  nearest  zero  bit  both  to  its  left  and  right. 

Result:  An  almost  fully-parallel  algorithm. 

Literature.  A  similar  problem  was  considered  by  [CFL-83]  where  the  motivation  was  circuits. 

3.  The  parentheses  matching  problem.  Suppose  we  are  given  a  legal  sequence  of 
parentheses.  Find  for  each  parenthesis  its  mate. 

Result.  Assuming  the  level  of  nesting  of  each  parenthesis  is  given,  we  have  an  almost  fully- 
parallel  algorithm.  Without  this  assumption  T  = O  (logn  /loglog n ),  using  an  optimal  number  of  pro¬ 
cessors. 

Literature.  Parentheses  matching  in  parallel  was  considered  by  [AMW-88],  [BSV-88],  [BV-85] 
and  [DS-83], 
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Remark.  The  algorithm  for  the  parentheses  matching  is  delayed  into  a  later  paper  ([BeV-90a]),  in 
order  to  keep  this  paper  within  reasonable  length. 

4.  Restricted  domain  merging.  Let  A  =(a\,...,an)  and  B  =  (b\,...ybn),  be  two  non-decreasing 
lists,  whose  elements  are  integers  drawn  from  the  domain  [1,. The  problem  is  to  merge  them 
into  a  sorted  list. 

Result:  An  almost  fully-parallel  algorithm. 

Literature.  Merging  in  parallel  was  considered  by  [Van-89],  [BHo-85],  [Kr-83],  [SV-81]  and  [Va- 
75]. 

Remark.  The  merging  algorithm  is  presented  in  another  paper  ([BeV-90]).  The  length  of  this 
paper  was  again  a  concern.  Also,  we  recently  found  a  way  to  implement  it  on  a  less  powerful 
model  (CREW  PRAM)  with  the  same  bounds,  and  somewhat  relax  the  restricted-domain  limitation 
using  unrelated  techniques. 

5.  Almost  fully-parallel  reducibility.  Let  A  and  B  be  two  problems.  Suppose  that  any  input 
of  size  n  for  problem  A  can  be  mapped  into  an  input  of  size  O  (n)  for  problem  B .  Such  mapping 
from  A  to  B  is  an  almost  fully-parallel  reducibility  if  it  can  be  realized  by  an  almost  fully-parallel 
algorithm. 

Given  a  convex  polygon,  the  all  nearest  neighbors  (ANN)  problem  is  to  find  for  each  vertex  of  the 
polygon  its  nearest  (Euclidean)  neighbor.  Using  almost  fully-parallel  reducibilities  we  prove  the 
following  lower  bound  for  the  ANN  problem:  Any  CRCW  PRAM  algorithm  for  the  ANN  problem 
that  uses  O  (n  logc  n )  (for  any  constant  c )  processors  needs  £2(loglog  n )  time. 

We  note  that  this  lower  bound  was  proved  in  [ScV-88a]  using  a  considerably  more  involved 
technique. 

Fully-parallel  results 

For  our  fully-parallel  results  we  introduce  the  CRCW-bit  PRAM  model  of  computation.  In  addi¬ 
tion  to  the  above  definition  of  the  CRCW  PRAM  we  assume  that  a  few  processors  can  write 
simultaneously  each  into  different  bits  of  the  same  word.  Specifically,  in  our  algorithms  this 
number  of  processors  is  very  small  and  never  exceeds  0(1  d(n)),  where  d  is  a  constant.  Therefore, 
the  assumption  looks  to  us  quite  reasonable  from  the  architectural  point  of  view.  We  believe  that 
the  cost  for  implementing  a  step  of  a  PRAM  on  a  feasible  machine  is  likely  to  absorb  implementa¬ 
tion  of  this  assumption  at  no  extra  cost.  Though,  we  do  not  suggest  to  consider  the  CRCW-bit 
PRAM  as  a  theoretical  substitute  for  the  CRCW  PRAM. 

Specific  fully-parallel  results 

1.  The  lowest  common  ancestor  problem.  The  preprocessing  algorithm  is  fully-parallel,  assuming 
that  the  Euler  tour  of  the  tree  and  the  level  of  each  vertex  in  the  tree  are  given.  A  query  can  be 


processed  in  constant  time. 

2.  The  all  nearest  zero  bit  problem.  The  algorithm  is  fully-parallel. 

3.  The  parentheses  matching  problem.  Assuming  the  level  of  nesting  of  each  parenthesis  is  given, 
the  algorithm  is  f lily-parallel. 

4.  Restricted  domain  merging.  The  algorithm  is  fully-parallel.  Results  3  and  4  can  be  derived 
from  [BeV-90a]  and  [BeV-90]  respectively  similar  to  the  fully-parallel  algorithms  here. 

We  elaborate  on  where  our  fully-parallel  algorithms  use  the  CRCW-bit  new  assumption.  The 
algorithms  work  by  mapping  the  input  of  size  n  into  n  bits.  Then,  given  any  constant  d,  we 
derive  a  value  x  =  0(1 d  (n )).  The  algorithms  proceed  by  forming  n  lx  groups  of  x  bits  each.  Infor¬ 
mally,  our  problem  is  then  to  pack  all  x  bits  of  the  same  group  into  a  single  word  and  solve  the 
original  problem  with  respect  to  an  input  of  size  x  in  constant  time.  This  packing  is  exactly 
where  our  almost  fully-parallel  CRCW  PRAM  algorithms  fail  to  become  fully-parallel  and  the 
CRCW-bit  assumption  is  used.  We  believe  that  it  is  of  theoretical  interest  to  figure  out  ways  for 
avoiding  such  packing  and  thereby  get  fully-parallel  algorithms  on  a  CRCW  PRAM  without  the 
CRCW-bit  assumption  and  suggest  it  as  open  problem. 

A  repeating  motif  in  the  present  paper  is  putting  restrictions  on  domain  of  problems.  Perhaps 
our  more  interesting  applications  concern  problems  whose  input  domain  is  not  explicitly  res¬ 
tricted.  However,  as  part  of  the  design  of  our  algorithms  for  these  respective  problems,  we 
identified  a  few  subproblems,  whose  input  domains  come  restricted. 

The  rest  of  this  paper  is  organized  as  follows.  Section  2  describes  the  recursive  *-tree  data- 
structure  and  section  3  recalls  a  few  basic  problems  and  algorithms.  In  section  4  the  algorithms 
for  LCA  and  all  nearest  zero  bit  are  presented.  The  almost  fully  parallel  reducibility  is  presented 
in  Section  5  and  the  last  section  discusses  how  to  compute  efficiently  functions  that  are  used  in 
this  manuscript. 

2.  The  recursive  *-tree  data-structure 

Let  n  be  a  positive  integer.  We  define  inductively  a  series  of  a(n)-l  trees.  For  each 
2  <,  m  <,  a(n )  a  balanced  tree  with  n  leaves,  denoted  BT  {m )  (for  Balanced  Tree),  is  defined.  For 
a  given  m ,  BT(m )  is  a  recursive  tree  in  the  sense  that  each  of  its  nodes  holds  a  tree  of  the  form 
BT(m- 1). 

The  base  of  the  inductive  definition  (see  Figure  2.1).  We  start  with  the  definition  of  the  *-tree 
BT  (2).  BT (2)  is  simply  a  complete  binary  tree  with  n  leaves. 

The  inductive  step  (see  Figure  2.2).  For  m ,  3  <,  m  <,  a(n ),  we  define  BT  (m )  as  follows.  BT  {m ) 
has  n  leaves.  The  number  of  levels  in  BT(m)  is  */m_i(n)+l  (=/m(/t)+l).  The  root  is  at  level  1 


and  the  leaves  are  at  level  */m_i(n)+l.  Consider  a  node  v  at  level  1  <  /  <  *Im-i(n)  of  the  tree. 
Node  v  has  Imlp  (n)/l£l\  ( n )  children  (we  define  I^-i  (n)  to  be  n).  The  total  number  of  leaves 
in  the  subtree  rooted  at  node  v  is  (n ).  We  refer  to  the  part  of  the  BT(m)  tree  described  so 
far  as  the  top  recursion  level  of  BT  (m )  (denoted  for  brevity  TRL  -BT  (m )  ).  In  addition,  node  v 
contains  recursively  a  BT(m- 1)  trca  The  number  of  leaves  in  this  tree  is  exactly  the  number  of 
children  of  node  v  in  BT ( m ). 

In  a  nutshell,  there  is  one  key  idea  that  enabled  our  algorithms  to  be  as  fast  as  they  are. 
When  the  m’th  tree  BT(m)  is  employed  to  guide  the  computation,  we  invest  0(1)  time  on  the 
top  recursion  level  for  BT  ( m )!  Since  BT  (m )  has  m  levels  of  recursion,  this  leads  to  a  total  of 
0(m)  time. 

Similar  computational  structures  appeared  in  a  few  contexts.  See,  [AS-87]  and  [Y-82]  for 
generalized  range-minima  computations,  [HS-86]  for  Davenport-Schinzel  sequences  and  [CFL-83] 
for  circuits. 

3.  Basics 

We  will  need  the  following  problems  and  algorithms. 

The  Euler  tour  technique 

Consider  a  tree  T  =  (V,£),  rooted  at  some  vertex  r.  The  Euler  tour  technique  enables  to 
compute  several  problems  on  trees  in  logarithmic  time  and  optimal  speed-up  (see  also  [TV-85] 
and  [Vi-85]).  The  technique  is  summerized  below. 

Step  1:  For  each  edge  (v— »«)  in  T  we  add  its  anti -parallel  edge  (u  — >v ).  Let  H  denote  the  new 
graph. 

Since  the  in-degree  and  out-degree  of  each  vertex  in  H  are  the  same,  H  has  an  Euler  path 
that  start  and  ends  in  the  root  r  of  T.  Step  2  computes  this  path  into  a  vector  of  pointers  D, 
where  for  each  edge  e  of  H ,  D(e)  will  have  the  successor  edge  of  e  in  the  Euler  path. 

Step  2:  For  each  vertex  v  of  H ,  we  do  the  following: 

(Let  the  outgoing  edges  of  v  be  (v-»m0)  ,  ■  •  •  ,  {v-^ud_x).) 

Z)(M(-»v):=(v-»u(i+1>w</  d),  for  i  =  0,  •  •  •  ,d- 1.  Now  D  has  an  Euler  circuit.  The  "correction" 
D  (ud-\—*r):=end-of  -list  (where  the  out-degree  of  r  is  d)  gives  an  Euler  path  which  starts  and 
ends  in  r. 

Step  3:  In  this  step,  we  apply  list  ranking  to  the  Euler  path.  This  will  result  in  ranking  the  edges 
so  that  the  tour  can  be  stored  in  an  array.  Similarly,  we  can  find  for  each  vertex  in  the  tree  its 
distance  from  the  root.  This  distance  is  called  the  level  of  vertex  v.  Such  applications  of  list 
ranking  appear  in  [TV-85].  This  list  ranking  can  be  performed  in  logarithmic  time  using  an 
optimal  number  of  processors,  by  [AM-88],  [CV-86]  or  [CV-89]. 
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Comments: 

1.  In  Section  4  we  assume  that  the  Euler  tour  is  given  in  an  already  ranked  form.  There  we  sys¬ 
tematically  replace  each  edge  ( u  ,v )  in  the  Euler  tour  by  the  vertex  v .  We  then  add  the  root  of  the 
tree  to  the  beginning  of  this  new  array.  Suppose  a  vertex  v  has  /  children.  Then  v  appears  /+1 
times  in  our  array. 

2.  We  note  that  while  advancing  from  a  vertex  to  its  successor  in  the  Euler  tour,  the  level  may 
either  increase  by  one  or  decrease  by  one. 

Finding  the  minimum  for  restricted-domain  inputs 

Input:  Array  A=(a1,a2.  •  •  •  ,an)  of  numbers.  The  restricted-domain  assumption:  each  a,  is  an 
integer  between  1  and  n . 

Finding  the  minimum.  Find  the  minimum  value  in  A  . 

Fich,  Ragde  and  Wigderson  [FRW-84]  gave  the  following  parallel  algorithm  for  the 
restricted-domain  minimum  finding  problem.  It  runs  in  0(  1)  time  using  n  processors.  We  use  an 
auxiliary  vector  B  of  size  n,  that  is  all  zero  initially.  Processor  i  ,1  <  i  <  n,  writes  one  into  loca¬ 
tion  B  (a, ).  The  problem  now  is  to  find  the  leftmost  one  in  B .  Partition  B  into  Vn”  equal  size 
subarrays.  For  each  such  subarray  find  in  0(1)  time,  using  VrT  processors  if  it  contains  a  one. 
Apply  the  0(1)  time  of  Shiloach  and  Vishkin  [SV-81]  for  finding  the  leftmost  subarray  of  size  'Jn 
containing  one,  using  n  processors.  Finally,  reapply  this  latter  algorithm  for  finding  the  index  of 
the  leftmost  one  in  this  subarray. 

(Remark  3.1.  This  algorithm  can  be  readily  generalized  to  yield  0(1)  time  for  inputs  between  1 
and  pc ,  where  c>l  is  a  constant,  as  long  as  p  >n  processors  are  used.) 

Range-minima  problem 

Given  an  array  A=(alfa2,  •  •  ■  ,an)  of  n  real  numbers,  preprocess  the  array  so  that  for  any 
interval  [a;,ai+1,  •  ■  •  a;],  the  minimum  over  the  interval  can  be  retrieved  in  constant  time  using  a 
single  processor. 

We  show  how  to  preprocess  A  in  constant  time  using  n 1+E  processors  and  O  (n 1+£)  space,  for 
any  constant 

The  preprocessing  algorithm  uses  the  following  naive  parallel  algorithm  for  the  range-minima 
problem.  Allocate  n2  processors  to  each  interval  [a,  ,  ai+l  ,  •  •  •  ,  a,]  and  find  the  minimum  over 
the  interval  in  constant  time  as  in  [SV-81].  This  naive  algorithm  runs  in  constant  time  and  uses 
n4  processors  and  n 2  space. 

The  preprocessing  algorithm.  Suppose  some  constant  e  is  given. 

The  output  of  the  preprocessing  algorithm  can  be  described  by  means  of  a  balanced  tree  T 
with  n  leaves,  where  each  internal  node  has  n 6/3  children.  The  root  of  the  tree  is  at  level  one  and 
the  leaves  are  at  level  3/e+l.  Let  v  be  an  internal  node  and  U\  ,  •  ■  •  ,  un &  be  its  children.  Each 
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internal  node  v  will  have  the  following  data. 

1.  M(v)  -  the  minimum  over  its  leaves  (i.e.  the  leaves  of  its  subtree). 

2.  For  each  1  <  i  <  j  <  na3  the  minimum  over  the  range  {M(m,)  ,  M (wl+1)  ,  •  •  •  ,  M(Uj)}. 
This  will  require  Ofrt2'6)  space  per  internal  node  and  a  total  of  O  (n 1+£/3)  space. 

The  preprocessing  algorithm  advances  through  the  levels  of  the  tree  in  3/e  steps,  starting 
from  level  3/e.  At  each  level,  each  node  computes  its  data  using  the  naive  range-minima  algo¬ 
rithm. 

Each  internal  node  uses  n^E  processors  and  O (n 2j'e)  space  and  the  whole  algorithm  uses  n 1+e 
processors,  O  ( n 1+e/3)  space  and  O  (3/e)  time. 

Retrieval.  Suppose  we  wish  to  find  the  minimum,  MIN  (i  ,j )  over  an  interval  [a(-,ai+1,  •  ■  •  a,  ]. 

Let  LCA  (a i  ,a; )  be  the  lowest  common  ancestor  of  leaf  a,  and  leaf  <2;  in  the  tree  T.  There 
are  two  possibilities. 

(i)  LCA  (fli  ,aj )  is  at  level  3/e.  The  data  at  LCA  (a,-  ,aj )  gives  MIN  ( i  ,j ). 

(ii)  LCA  (at  ,aj )  is  at  level  <  3/e.  Let  x  be  the  number  of  edges  in  the  path  from  a,  (or  a; )  to 
LCA  (a^aj)  (i.e.  LCA  (a,- ,ay )  is  at  level  3/e+l-x).  Using  the  tree  T  we  can  represent  interval 
[/,/■]  as  a  disjoint  union  of  2jc  —  1  intervals  whose  minimum  were  previously  computed.  For¬ 
mally,  let  r  (i )  denote  the  rightmost  leaf  of  the  parent  of  a,  in  T  and  /  (j )  denote  the  leftmost 
leaf  of  the  parent  of  a,  in  T .  MIN[i,j]  is  the  minimum  among  three  numbers. 

1.  MIN  [/  ,;'(j  )].  The  data  at  the  parent  of  a(-  gives  this  information. 

2.  MIN  [l  (j  ),j  ].  The  data  at  the  parent  of  a]  gives  this  information. 

3.  MIN[r(i)+lJ(j)-l].  Advance  to  level  3/e  of  the  tree  to  get  this  information  recursively  . 

Complexity.  Retrieval  of  MIN(i  J)  takes  constant  time;  The  first  and  second  numbers  can  be 
looked  up  in  0(1)  time.  Retrieval  of  the  third  number  takes  0(3/e-l)  time. 

The  range-minima  problem  was  considered  by  [AS-87],  [GBT-84]  and  [Y-82].  In  the  paral¬ 
lel  context  the  problem  was  considered  in  [BBGSV-89], 

4.  LCA  Algorithm 

The  input  to  this  problem  is  a  rooted  tree  T  =(V tE).  Denote  n  =  2\V  1-1.  We  assume  that  we  are 
given  a  sequence  of  n  vertices  A  =  [av  .  .  .  ,an],  which  is  the  Euler  tour  of  our  input  tree,  and 
that  we  know  for  each  vertex  v  its  level,  LEVEL  (v ),  in  the  tree. 

Recall  the  range-minima  problem  defined  in  Section  3.  Below  we  give  a  simple  reduction 
from  the  LCA  problem  to  a  restricted-domain  range-minima  problem,  which  is  an  instance  of  the 
range-minima  problem  where  the  difference  between  each  two  successive  numbers  for  the  range- 
minima  problem  is  exactly  one.  The  reduction  takes  O  (1)  time  and  uses  n  processors.  An 


algorithm  for  the  restricted-domain  range-minima  problem  is  given  later,  implying  an  algorithm 
for  the  LCA  problem. 

4.1.  Reducing  the  LCA  problem  into  a  restricted-domain  range-minima  problem 

Let  v  be  a  vertex  in  7.  Denote  by  /(v)  the  index  of  the  leftmost  appearance  of  v  in  A  and  by 
r(v)  the  index  of  its  rightmost  appearance.  For  each  vertex  v  in  7,  it  is  easy  to  find  /  (v )  and  r(v ) 
in  0(1)  time  and  n  processors  using  the  following  (trivial)  observation: 

/(v)  is  where  fl/(v)=v  and  LEVEL (al(vy_l)  =  LEVEL (v)-l. 
r(v)  is  where  tfr(v)=v  and  LEVEL  (ar^)+x)= LEVEL  (v)-l. 

The  claims  and  corollaries  below  provide  guidelines  for  the  reduction. 

Claim  1:  Vertex  u  is  an  ancestor  of  vertex  v  iff  /(«)  <  /(v)  <  r(u). 

Corollary  1:  Given  two  vertices  u  and  v,  a  single  processor  can  find  in  constant  time 
whether  u  is  an  ancestor  of  v . 

Corollary  2:  Vertices  u  and  v  are  unrelated  (namely,  neither  u  is  an  ancestor  of  v  nor  v  is 
an  ancestor  of  u )  iff  either  r  (u )  <  l  (v )  or  r  (v )  <  /  (u ). 

Claim  2.  Let  u  and  v  be  two  unrelated  vertices.  (By  Corollary  2,  we  may  assume  without 
loss  of  generality  that  r(u)  <  l  (v).)  Then,  the  LCA  of  u  and  v  is  the  vertex  whose  level  is 
minimal  over  the  interval  [r(u)  ,  /  (v )]  in  A  . 

The  reduction.  Let  LEVEL  (A)  =  [LEVEL  (a \),  LEVEL  (a2),---EEVEL  (an )].  Claim  2  shows  that 
after  performing  the  range-minima  preprocessing  algorithm  with  respect  to  LEVEL  (A ),  a  query  of 
the  form  LCA  ( u  ,v )  becomes  a  range  minimum  query.  Observe  that  the  difference  between  the 
level  of  each  pair  of  successive  vertices  in  the  Euler  tour  (and  thus  each  pair  of  successive  entries 
in  LEVEL  (A ))  is  exactly  one  and  therefore  the  reduction  is  into  the  restricted-domain  range- 
minima  problem  as  required. 

Remark.  [GBT-84]  observed  that  the  problem  of  preprocessing  an  array  so  that  each  range- 
minimum  query  can  be  answered  in  constant  time  (this  is  the  range-minima  problem  defined  in  the 
previous  section)  is  equivalent  to  the  LCA  problem.  They  gave  a  linear  time  algorithm  for  the 
former  problem  using  an  algorithm  for  the  latter.  This  does  not  look  very  helpful:  we  know  to 
solve  the  range-minima  problem  based  on  the  LCA  problem,  and  conversely,  we  know  to  solve 
the  LCA  problem  based  on  the  range-minima  problem.  Nevertheless,  using  the  restricted  domain 
properties  of  our  range-minima  problem  we  show  that  this  cyclic  relationship  between  the  two 
problems  can  be  broken  and  thereby,  lead  to  a  new  algorithm. 


-  10  - 


4.2.  The  restricted-domain  range-minima  algorithm 

We  define  below  a  restricted-domain  range-minima  problem  which  is  slightly  more  general  than 
the  problem  for  LEVEL  (/l ).  The  more  general  definition  enables  recursion  in  the  algorithm 
below.  The  rest  of  this  section  shows  how  to  solve  this  problem. 

The  restricted-domain  range-minima  problem 

Input:  Integer  k  and  array  A={a\,a 2,  •  •  •  ,an)  of  integers,  such  that  lai-ai+1l  <  k.  In  words,  the 
difference  between  each  at ,  1  <  i  <  n,  and  its  successor  aJ+1  is  at  most  k.  The  parameter  k  need 
not  be  a  constant. 

The  range-minima  problem.  Preprocess  the  array  A=(a1,a2»  '  '  '  ><*„)  so  that  any  query  MIN[i,j], 
1  <  i  <  j  <  n ,  requesting  the  minimal  element  over  the  interval  [<Z;  ,  •  •  •  ,  a}],  can  be  processed 
quickly  using  a  single  processor. 

Comment  I:  We  make  the  simplifying  assumption  that  V&  is  always  an  integer. 

Comment  2.  In  case  the  minimum  in  the  interval  is  not  unique,  find  the  minimal  element  in 
[aL ,  •  •  ■  ,  a.  1  whose  index  is  smallest  ("the  leftmost  minimal  element").  Throughout  this  section, 
whenever  we  refer  to  a  minimum  over  an  interval,  we  will  always  mean  the  leftmost  minimal  ele¬ 
ment.  Finding  the  leftmost  minimal  element  (and  not  just  the  minimum  value)  will  serve  us  later. 

We  start  by  constructing  inductively  a  series  of  a (n  )-l  parallel  preprocessing  algorithms  for 
our  range-minima  problem: 

Lemma  4.2.1.  The  algorithm  for  2  <m  <  a(n )  runs  in  cm  time,  for  some  constant  c,  using 
nlm(n)  +  'fkn  processors.  The  preprocessing  algorithm  results  in  a  table.  Using  this  table,  any 
range-minimum  query  can  be  processed  in  cm  time.  In  addition,  the  preprocessing  algorithm  finds 
explicitly  all  prefix-minima  and  suffix-minima,  and  therefore  there  is  no  need  to  do  any  processing 
for  prefix-minima  or  suffix-minima  queries. 

Our  optimal  algorithms,  whose  efficiencies  are  given  in  Theorem  4.3.1,  are  derived  from  this 
series  of  algorithms. 

We  describe  the  series  of  preprocessing  algorithms.  We  give  first  the  base  of  the  inductive 
construction  and  later  the  inductive  step. 

The  base  of  the  inductive  construction  (the  algorithm  for  m  =2) 

In  order  to  provide  intuition  for  the  description  of  the  preprocessing  algorithm  for  m=2  we  present 
first  its  output  and  how  the  output  can  be  used  for  processing  a  range-minimum  query. 

Output  of  the  preprocessing  algorithm  for  m  =2: 
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(1)  For  each  consecutive  subarray  a,log  j„+1  ,  •  •  •  -  a(j+i)\og\>  -  «/log3n-l,  we  keep  a 

table.  The  table  enables  constant  time  retrieval  of  any  range-minimum  query  within  the  subar¬ 

ray. 

(2)  Array  B  =  b\  ,  ■  •  •  ,  bn/log in  consisting  of  the  minimum  in  each  subarray. 

(3)  A  complete  binary  tree  BT(2),  whose  leaves  are  ,  •  •  •  ,  bn/ |og3„.  Each  internal  node  v  of 

the  tree  holds  an  array  Pv  with  an  entry  for  each  leaf  of  v.  Consider  prefixes  that  span 

between  /  (v ),  the  leftmost  leaf  of  v ,  and  a  leaf  of  v .  Array  Pv  has  the  minima  over  all  these 
prefixes.  Node  v  also  holds  a  similar  array  Sv.  For  each  suffix,  that  spans  between  a  leaf  v 
and  r  (v ),  the  rightmost  leaf  of  v ,  array  Sv  has  its  minimum. 

(4)  Two  arrays  of  size  n  each,  one  contains  all  prefix-minima  and  the  other  all  suffix-minima 
with  respect  to  A . 

Lemma  4.2.2.  Let  m  be  2.  Then  Im(n)  =  log/z.  The  preprocessing  algorithm  will  run  in  2c  time 

for  some  constant  c  using  nlogn+^lcn  processors.  The  retrieval  time  of  a  query  MIN[i,j]  is  2c. 


How  to  retrieve  a  query  MIN  [/  J  ]  in  constant  time ? 


There  are  two  possibilities. 

(i)  a,  and  a;  belong  to  the  same  subarray  (of  size  log3/i).  MIN(i,j )  is  computed  in  0(1)  time 
using  the  table  that  belongs  to  the  subarray. 

(ii)  a,  and  a}  belong  to  different  subarrays.  We  elaborate  below  on  possibility  (ii). 

Let  right  (i)  denote  the  rightmost  element  in  the  subarray  of  a,  and  left{j)  denote  the  leftmost 
element  in  the  subarray  of  aj .  MIN  [/  ,j  ]  is  the  minimum  among  three  numbers. 

1.  MIN  [i , right (/)],  the  minimum  over  the  suffix  of  a,  in  its  subarray. 

2.  MIN[left(j),j],  the  minimum  over  the  prefix  of  aj  in  its  subarray. 

3.  MIN  [right  (i)+\,left(j)-l]. 


The  retrieval  of  the  first  and  second  numbers  is  similar  to  possibility  (i)  above.  Denote 


1 1  = 


i  /log3n 


+1  and  jx  =  y'/log3/!  -1.  We  discuss  retrieval  of  the  third  number.  This  is  equal 


to 


finding  the  minimum  over  interval  [bix  ,  •  •  •  ,  bjj  in  B ,  which  is  denoted  MINB  [i  hj{]. 


Let  x  be  the  lowest  common  ancestor  of  bi{  and  bjr  xx  be  the  child  of  x  that  is  an  ancestor 
of  bi{  and  x2  be  the  child  of  x  that  is  an  ancestor  of  bjx. 

MINB  [/  ij'i]  is  the  minimum  among  two  numbers: 

1.  MINg  [/ hr(x  j)],  the  minimum  over  the  suffix  of  bix  in  jc  j.  We  get  this  from  Sxr 

2.  MINB[l(x2),j i),  the  minimum  over  the  prefix  of  bji  in  x2.  We  get  this  from  P 
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It  remains  to  show  how  to  find  x,  xx  and  x2  in  constant  time.  It  was  observed  in  [HT-84]  that  the 
lowest  common  ancestor  of  two  leaves  in  a  complete  binary  tree  can  be  found  in  constant  time 
using  one  processor.  (The  idea  is  to  number  the  leaves  from  0  to  n-l.  Given  two  leaves  i  \  and  j  j, 
it  suffices  to  find  the  most  significant  bit  in  which  the  binary  representation  of  i  x  and  j  j  are 
different,  in  order  to  get  their  lowest  common  ancestor.)  Thus  x  (and  thereby  also  x  l  and  x2)  can 
be  found  in  constant  time.  Constant  time  retrieval  of  MIN(i,j)  query  follows. 

The  preprocessing  algorithm  for  m- 2 

Step  1.  Partition  A  into  subarrays  of  log3/!  elements  each.  Allocate  log4/!  processors  to  each 
subarray  and  apply  the  preprocessing  algorithm  for  range-minima  given  in  Section  3  (for  e  =  1/3). 
This  uses  log4«  processors  and  <9(log3/ilog1/3/i)  space  per  subarray  and  /ilogn  processors  and 
o  (/i  log/i )  space  overall. 

Step  2.  Take  the  minimum  in  each  subarray  to  build  ^rray  B  of  size  /i/log3/i.  The 
difference  between  two  successive  elements  in  B  is  at  most  k  lo^  /, . 

Step  3:  Build  BT( 2),  a  complete  binary  tree,  whose  leaves  are  the  elements  of  B.  For  each 
internal  node  v  of  BT  (2)  we  keep  an  array.  The  array  consists  of  the  values  of  all  leaves  in  the 
subtree  rooted  at  v.  So,  the  space  needed  is  /i/log3/!  per  level  and  n /log2/!  for  all  levels  of  the 
tree.  We  allocate  to  each  leaf  at  each  level  VT  log2/!  processors  and  the  total  number  of  processors 
used  is  thus  'Tkn . 

Step  4:  For  each  internal  node,  find  the  minimal  element  over  its  array.  If  the  minimal  ele¬ 
ment  is  not  unique,  the  leftmost  one  is  found.  We  apply  the  constant  time  algorithm  mentioned  in 
Remark  3.*.  Consider  an  internal  node  of  size  r.  After  subtracting  the  first  element  of  the  array 
from  each  of  its  elements,  we  get  an  array  whose  elements  range  between  -kr  log3/!  and  kr  log3 n. 
The  size  of  the  range,  which  is  2kr  log3/i+l,  does  not  exceed  the  square  of  number  of  processors, 
which  is  r'lk  log2/i,  and  the  algorithm  of  Remark  3.1  can  be  applied. 

Step  5:  For  each  internal  node  v  we  compute  Pv  ( Sv  is  computed  similarly):  That  is,  for  each 
leaf  b,  of  v ,  we  need  to  find  MINB  [/  (v  ),i  ]  (that  is,  the  minimum  over  the  prefix  of  b,  in  v ).  For 
this,  the  minimum  among  the  following  list  of  (at  most)  logn+l  numbers  is  computed.  Denote  the 
level  of  v  in  the  binary  tree  by  level (v).  Each  level  /,  level (v)  <  /  <  logn+l,  of  the  tree  contri¬ 
butes  (at  most)  one  number.  Let  u  denote  the  ancestor  at  level  /-I  of  b,-.  Let  ux  and  u2  denote 
the  left  and  right  children  of  u  respectively.  If  b,  belongs  to  (the  subtree  rooted  at)  u2  then  level 
/  contributes  the  minimum  over  U\.  If  b,  belongs  to  ut  then  level  /  does  not  contribute  anything 
(actually,  level  /  contributes  a  large  default  value  so  that  the  minimum  computation  is  not 
affected).  Finally,  bi  is  also  included  in  the  list.  This  minimum  computation  can  be  done  in  con¬ 
stant  time  using  log2/!  processors  by  the  algorithm  of  [SV-81].  Note  that  all  prefix-minima  and  all 
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suffix-minima  of  B  are  computed  (in  the  root)  in  this  step. 

Step  6.  For  each  a-t  we  find  its  prefix  niinimum  and  its  suffix  minimum  with  respect  to  A 
using  one  processor  in  constant  time.  Let  bj  be  the  minimum  representing  the  subarray  of  size 
log3/!  containing  a,-.  The  minimum  over  the  prefix  of  a,-  with  respect  to  A  is  the  minimum 
between  the  prefix  of  bj_y  with  respect  to  B  and  the  minimum  over  the  prefix  of  aL  with  respect 
to  its  subarray. 

This  completes  the  description  of  the  inductive  base:  Items  (1),  (2),  (3)  and  (4)  of  the  output  were 
computed  (respectively)  in  steps  1,  2,  5  and  6  above. 

Complexity  of  the  inductive  base.  0(1)  time  using  n  logn+'lkn  processors. 

Lemma  4.2.2  follows. 

The  inductive  step 

The  algorithm  is  presented  in  a  similar  way  to  the  inductive  base. 

Output  of  the  m'th  preprocessing  algorithm 

(1)  For  each  consecutive  subarray  a jifay+x  >  '  '  '  ■>  ®  -  J  1,  we  keep  a 

table.  The  table  enables  constant  time  retrieval  of  any  range-minimum  query  within  the  subar¬ 
ray.  ( Comment .  The  notation  I„(n )  means  (/m (n ))3  where  Im(n)  is  defined  earlier.) 

(2)  Array  B  =  b{  ,  ■  ■  ■  ,  bn/Ii{n)  consisting  of  the  minimum  in  each  subarray. 

(3)  TRL-BT  (m),  the  top  recursion  level  of  (the  recursive  *-tree)  BT(m),  whose  leaves  are 
b  i  ,  •  •  •  ,  bn,ti(ny  Each  internal  node  v  of  TRL-BT (m)  holds  an  array  Pv  and  array  Sv 
with  an  entry  for  each  leaf  of  v.  These  arrays  hold  (as  in  the  binary  tree  for  m=  2)  prefix- 
minima  and  suffix-minima  with  respect  to  the  leaves  of  v . 

(4)  Let  ul  ,  ■  ■  ■  ,  uy  be  the  children  of  v,  an  internal  node  of  TRL-BT (m).  Denote  by 
MIN  (uL )  be  the  minimum  over  the  leaves  of  node  m,  ,  1  <  i  <  y .  Each  such  node  v  has  recur¬ 
sively  the  output  of  the  m-l’th  preprocessing  algorithm  with  respect  to  the  input 
MIN  (u  i)  ,  •••  ,  MIN  (uy ). 

(5)  Two  arrays  of  size  n  each,  one  contains  all  prefix-minima  and  the  other  all  suffix-minima 
with  respect  to  A . 

How  to  retrieve  a  query  MIN[i  J]  in  cm  time ? 

We  distinguish  two  possibilities. 

(i)  a,  and  belong  to  the  same  subarray  (of  size  /^(n)).  MIN(i,j)  is  computed  in  0(1)  time 
using  the  table  that  belongs  to  the  subarray. 


(ii)  a,  and  aj  belong  to  different  subarrays.  We  elaborate  on  possibility  (ii). 

Again,  let  right  (i)  denote  the  rightmost  element  in  the  subarray  of  a,-  and  left(J )  denote  the  left¬ 
most  element  in  the  subarray  of  aj .  MIN  [t  J  ]  is  the  minimum  among  three  numbers. 

1.  MIN [i, right {i)]. 

2.  MIN  [left  (J)Jl 

3.  MIN  [right  {i)+\,left(j)-\). 


The  retrieval  of  the  first  and  second  numbers  is  similar  to  possibility  (i)  above.  Denote 


ii  = 


+1  and  ji  = 


J"^n) 


-1.  Retrieval  of  the  third  number  is  equal  to  finding  the 


minimum  over  interval  [&,-  ,  •  •  •  ,  bjx]  in  B  which  is  denoted  MIN B[i  {[. 


Let  x  be  the  lowest  common  ancestor  of  blx  and  bJx  in  TRL-BT(m),  x be  the  child  of  x 
that  is  an  ancestor  of  bi{  and  xp^  be  the  child  of  x  that  is  an  ancestor  of  bj{.  MIN B[i J  is 
(recursively)  the  minimum  among  three  numbers. 

1.  MINb  [i  t,r  (*p(;,))].  the  minimum  over  the  suffix  of  bix  in  xp^.  We  get  this  from  Sx^.  . 

2.  MINB[l(x$(jx)),j{),  the  minimum  over  the  prefix  of  bJx  in  Xp^,).  We  get  this  from  /%  . 

3.  MflVj  [r(jrp(i ))+!,/ (xpy^)-!].  This  will  be  recursively  derived  from  the  data  at  node  x. 


The  first  two  numbers  are  precomputed  in  TRL  -BT  (m ).  The  recursive  definition  of  the  third 
number  implies  that  MINB[il,jl]  is  actually  the  minimum  among  4(m-l)-2  precomputed 
numbers.  Therefore,  in  order  to  show  that  retrieval  of  MIN  [/  J  ]  takes  time  proportional  tom,  as 
claimed,  it  remains  to  explain  how  to  find  the  nodes  x,  Xp(,,)  and  xp^y,)  in  constant  time  using  one 
processor.  This  is  done  below. 


We  first  note  that  for  each  leaf  of  TRL-BT(m),  finding  the  child  of  the  root  which  is  its 
ancestor  needs  constant  time  using  one  processor.  Given  two  leaves  of  TRL-BT(m)  consider 
their  ancestors  among  the  children  of  the  root.  If  these  ancestors  are  different,  we  are  done.  Sup¬ 
pose  these  ancestors  are  the  same. 

Each  child  of  the  root  has  Im_fn/I^(n))  <Im_\[n)  leaves.  Observe  that  for  TRL-BT (m) 
the  same  subtree  structure  is  replicated  at  each  child  of  the  root.  For  each  pair  of  two  leaves  u 
and  v  of  the  generic  subtree  structure,  we  will  compute  three  items  into  a  table:  (1)  their  lowest 
common  ancestor  w ;  (2)  the  child  f  of  w  which  is  an  ancestor  of  u ;  (3)  the  child  g  of  w  which 
is  an  ancestor  of  v .  The  size  of  the  table  is  only  O  (I^-i  (« ))• 

It  remains  to  show  how  the  table  is  computed.  Consider  an  internal  node  w  of  the  tree  and 
suppose  that  its  rooted  subtree  has  r  leaves.  At  node  w  each  pair  of  leaves  u ,  v  is  allocated  to  a 
processor.  The  processor  determines  in  constant  time  if  w  is  the  LCA  of  u  and  v .  This  is  done 
by  finding  whether  the  child  of  w  which  is  an  ancestor  of  u ,  denoted  / ,  and  the  child  of  w  which 
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is  an  ancestor  of  v  are  different.  If  yes,  then  w,  /  and  g  are  as  required  for  the  table.  The 
number  of  processors  needed  for  computing  the  table  is  O  (Jm-\  (n  ))• 

The  preprocessing  algorithm  for  m 

Inductively,  we  assume  that  we  have  an  algorithm  that  preprocesses  the  array 
A={ax,a2,  ‘  •  •  ,a„)  for  the  range-minima  problem  in  c(m- 1)  time  using  nlm_x(n)  +  f kn  proces¬ 
sors,  where  c  is  a  constant;  and  that  following  this  preprocessing  any  MIN[i,j]  query  can  be 
answered  in  c(m-l)  time.  We  construct  an  algorithm  that  solves  the  range-minima  problem  in 
Ci+c(m-l)  time  for  some  constant  clt  using  n/m(n)  +  fkn  processors.  We  have  already  shown 
that  a  query  can  be  answered  in  c2m  time  for  some  constant  c2.  Selecting  initially  c  >  Cj  and 
c  >  c2  implies  that  the  algorithm  runs  in  cm  time  using  nlm(n)  +  fkn  processors  and  that  a 
query  can  be  answered  in  cm  time. 

Step  1.  Partition  A  into  subarrays  of  l£(n )  elements  each.  Allocate  I^in)  processors  to  each 
subarray  and  apply  the  preprocessing  algorithm  for  range-minima  given  in  Section  3  (for  e  =  1/3). 
This  uses  processors  and  O {I^(n)l^(n))  space  per  subarray  and  nlm(n)  processors  and 

o  ( nlm  (n ))  space  overall. 

Step  2.  Take  the  minimum  in  each  subarray  to  build  array  B  of  size  n/l£(n).  The 
difference  between  two  successive  elements  in  B  is  at  most  kl£(n). 

Step  3:  Build  TRL-BT  (m),  the  upper  level  of  a  BT(m)  tree  whose  leaves  are  the  elements  of 
B.  Each  internal  node  of  TRL-BT  (m ),  whose  rooted  tree  has  r  leaves,  has  r//m_,  (r )  children. 
For  each  such  internal  node  v  of  TRL  -BT  ( m )  we  keep  an  array.  The  array  consists  of  the  values 
of  the  r  leaves  of  the  subtree  rooted  at  v.  TRL~BT(m)  will  have  *Im_x(n  II^(n  ))+\  <  Im(n)+ 1 
levels.  So,  the  space  needed  is  n/l£(n)  per  level  and  0(n//^(n))  for  all  levels  of  TRL-BT (m). 
We  allocate  to  each  leaf  at  each  level  1  +fkl*(n)  processors  and  the  total  number  of  processors 
used  is  thus  n//£(n)  +  fkn  (which  is  less  than  nJm(n)  +  fkn). 

Step  4.T.  For  each  internal  node  of  TRL-BT (m),  find  the  minimum  over  its  array.  The 
difference  between  the  minimum  value  and  the  maximum  value  in  an  array  never  exceeds  the 
square  of  its  number  of  processors  and  we  apply  the  constant  time  algorithm  mentioned  in  Remark 
3.1  as  in  Step  4  of  the  inductive  base  algorithm. 

Step  4.2:  We  focus  on  internal  node  v  having  r  leaves  in  TRL-BT (m).  Each  of  its  r//m_j(r) 
children  contributes  its  minimum  and  we  preprocess  these  minima  using  the  assumed  algorithm  for 
m-1.  The  difference  between  adjacent  elements  is  at  most  klm_x(r)l£(n).  Thus,  this  computation 
takes  c(m-l)  time  using  r+yjk/^(n)r  processors.  (To  see  this,  simplify 

,—rfrJm-\(r)+^klm^(r)I^(n)  -  r—  ,  the  processor  count  term  for  this  problem,  into 

*m- lv  /  *m-\V) 

r+yjkl^in )  r  /V/m_i(r )  which  is  less  than  r+yjkf*(n)r  processors.)  This  amounts  to 
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n/l^(n)+vju^(n) n/[£(n)  per  level  or  a  total  of  n ll„(n )+VF n /yjlm (n )  processors  which  is  less 
than  n/I^(n  y+'Hen. 

Step  5:  For  each  internal  node  v  we  compute  Pv  (Sv  is  computed  similarly).  That  is,  for  each 
leaf  bi  of  v  we  need  to  find  MINB  [l  (v ),/  ],  the  minimum  over  the  prefix  of  bi  with  respect  to  the 
leaves  of  node  v .  For  this,  the  minimum  among  the  following  list  of  at  most 
*fm_i(n)+l  =  /m(n)+l  numbers  is  computed:  Each  level  /,  level  (v )  <1  <  lm(n}+l,  of  the  tree 
contributes  (at  most)  one  number.  Let  u  denote  the  ancestor  at  level  /-I  of  b,  and  let  uh  ■  ■  •  ,uy 
denote  its  children,  which  are  at  level  /.  Suppose  Uj  j>\  is  an  ancestor  of  .  We  take  the  prefix- 
minimum  over  the  leaves  of  U\,  ■  ■  •  ,Uj- This  prefix-minimum  is  computed  in  the  previous  step 
(by  the  assumed  algorithm  for  m-1).  If  is  the  ancestor  of  bt  then  level  /  contributes  a  large 
default  value  (as  in  Step  5  of  the  inductive  base  algorithm).  Finally,  b,  is  also  added  to  the  list. 
This  minimum  computation  can  be  done  in  constant  time  using  l£(n )  processors  (by  the  algorithm 
of  [SV-81J).  Note  that  all  prefix-minima  and  all  suffix-minima  with  respect  to  B  are  computed  (in 
the  root)  in  this  step. 

Step  6.  For  each  al  we  find  its  prefix  minimum  and  its  suffix  minimum  with  respect  to  A 
using  one  processor  in  constant  time.  This  is  similar  to  Step  6  of  the  inductive  base. 

This  completes  the  description  of  the  inductive  step:  Items  (1),  (2),  (3),  (4)  and  (5)  of  the  output 
were  computed  (respectively)  in  steps  1,  2,  5,  4.2  and  6  above. 

Complexity  of  the  inductive  step 

In  addition  to  application  of  the  inductively  assumed  algorithm,  steps  1  through  6  take  constant 
time  using  nlm(n)  +  'Hen  processors.  This  totals  cm  time  using  nlm(n )  +  'Hen  processors. 

Together  with  Lemma  4.2.2,  Lemma  4.2.1  follows. 

From  recursion  to  algorithm 

The  recursive  procedure  in  Lemma  4.2.1  translates  easily  into  a  constructive  parallel  algorithm 
where  the  instructions  for  each  processor  at  each  time  unit  are  available.  For  such  translation 
issues  such  as  processor  allocation  and  computation  of  certain  functions  need  to  be  taken  into 
account.  Since  TRL  -BT  ( m )  is  balanced,  allocating  processors  in  the  algorithm  above  can  be  done 
in  constant  time  if  the  following  functions  are  precomputed:  (a)  lmix)  for  1  <  jc  <  n  and  (b) 
/m‘-i  (* )  for  1  ^  x  <  n  and  1  <  i  <  Im  (x ).  These  same  functions  suffice  for  all  other  computa¬ 
tions  above.  The  functions  will  be  computed  and  stored  in  a  table  at  the  beginning  of  the  algo¬ 
rithm.  The  last  section  discusses  their  computation. 
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4.3.  The  optimal  parallel  algorithms 

In  this  subsection,  we  show  how  to  derive  a  series  of  optimal  parallel  algorithms  from  the  series 
of  algorithms  described  in  Lemma  4.2.1.  Theorem  4.3.1  gives  a  general  trade-off  result  between 
running  time  of  the  preprocessing  algorithm  and  retrieval  time.  Corollary  4.3.1  emphasizes  results 
where  the  retrieval  time  for  a  query  is  constant.  Corollary  4.3.2  points  at  an  interesting  trade-off 
instance  where  the  retrieval  time  bound  is  increased  to  0(a(«))  and  the  preprocessing  algorithm 
runs  in  0(a(n))  (i.e.,  it  become  almost  fully-parallel). 

Theorem  4.3.1.  Consider  the  range-minima  problem,  where  k,  the  bound  on  the  difference 
between  two  successive  elements  in  A ,  is  constant.  For  each  2  <  m  <  a (n ),  we  present  a  parallel 
preprocessing  algorithm  whose  running  time  is  0  (/m  (n ))  using  an  optimal  number  of  processors. 
The  retrieval  time  of  a  query  is  0(m). 

Corollary  43.1.  When  m  is  constant  the  preprocessing  algorithm  runs  in  0(Im(n))  time  using 
rt/Im(n)  processors.  Retrieval  time  is  0(1). 

Corollary  4.3.2.  When  m  =  a (n)  the  preprocessing  algorithm  runs  in  0(a(n))  time  using  n/a(n) 
processors.  Retrieval  time  is  0  (a (n )). 

We  describe  below  the  optimal  preprocessing  algorithm  for  m  as  per  Theorem  4.3.1. 

Step  1.  Partition  A  into  subarrays  of  l£(n )  elements  each,  allocate  lm(n)  processors  to  each 
subarray  and  find  the  minimum  in  the  subarray.  This  can  be  done  in  0  ( Im  (n ))  time. 

Put  the  n  ll£(n )  minima  into  an  array  B  . 

Step  2.  Out  of  the  series  of  preprocessing  algorithms  of  Lemma  4.2. 1  apply  the  algorithm  for 
m  to  B ,  where  k' ,  the  difference  between  two  successive  elements  of  B  is  0(l£(n)).  This  will 
take  0{m)  time  using  'fkrn/l£(n)+n/Im(n )  processors.  This  can  be  simulated  in  O(m)  time 
using  n  II m  ( n )  processors. 

Step  3.  Preprocess  each  subarray  of  l^(n )  elements  so  that  a  range-minimum  query  within 
the  subarray  can  be  retrieved  in  0(1)  time.  This  is  done  using  the  following  parallel  variant  of  the 
range-minima  algorithm  of  [GBT-84]. 

Range-minimum:  a  parallel  variant  of  GBT’s  algorithm 

Consider  the  general  range-minima  problem  as  defined  in  Section  3,  with  respect  to  an  input  array 
C=(c!,c2,  ■  •  •  ,c„).  We  overview  a  preprocessing  algorithm  that  runs  in  0{Jn)  time  using  <n 
processors,  so  that  a  range-minimum  query  can  be  processed  in  constant  time. 

(1 )  Partition  array  C  into  'fit  subarrays  C  [  ,  ■  •  •  ,  C^  each  with  fn  elements. 

(2)  Apply  the  linear  time  serial  algorithm  of  [GBT-84]  separately  to  each  C,  ,  taking  0(V/T) 
time  using  fn  processors. 
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(3)  Let  ct  be  the  minimum  over  C;.  Apply  GBT’s  algorithm  to  C  =  (  Cj  ,  •  •  •  ,  c.^  )  in 
O  0 (n  )  time  using  a  single  processor. 

It  should  be  clear  that  any  range-minimum  query  with  respect  to  C  can  be  retrieved  in  constant 
time  by  at  most  three  queries  with  respect  to  the  tables  built  by  the  above  applications  of  GBT’s 
algorithm. 

Complexity  of  the  preprocessing  algorithm.  0(Im(n))  time  using  n II m (n )  processors.  Retrieval 
of  a  range-minimum  query  will  take  0(m+ 1)  time  which  is  0{m)  time. 

Theorem  4.3.1  follows. 

4.4.  The  fully-parallel  algorithms 

Consider  the  restricted-domain  range-minima  problem  where  k,  the  bound  on  the  difference 
between  adjacent  elements,  is  constant.  In  this  subsection,  we  present  a  fully-parallel  preprocessing 
algorithm  for  the  problem  on  a  CRCW-bit  PRAM  that  provides  for  constant  time  processing  of  a 
query.  Theorem  4.4.1  gives  the  general  result  being  achieved  in  this  subsection  including  trade¬ 
offs  among  parameters.  Corollary  4.4. 1  summarizes  the  fully-parallel  result. 

Let  d  be  an  integer  2  <  d  <  a (n ).  The  model  of  parallel  computation  is  the  CRCW-bit 
PRAM  with  the  assumption  that  up  to  Id(n)  processors  may  write  simultaneously  into  different 
bits  of  the  same  memory  word. 

Theorem  4.4.1  The  preprocessing  algorithm  takes  Old)  time  using  n  processors.  The  retrieval 
time  for  a  query  MIN  (i  ,j )  is  O  ( d ). 

Remark.  Theorem  4.4.1  represent  a  tradeoff  between  the  time  for  the  preprocessing  algorithm  and 
query  retrieval  on  one  hand  and  the  number  of  processors  that  may  write  simultaneously  into 
different  bits  of  the  same  memory  word  on  the  other  hand. 

Corollary  4.4.1.  For  a  constant  d,  the  algorithm  is  fully-parallel  and  query  retrieval  time  is  con¬ 
stant. 

Step  1.  Partition  A  into  n/Id(n)  subarrays  of  ld{n)  elements  each.  For  each  subarray,  find  the 
minimum  in  0(1)  time  and  Id(n )  processors.  For  this  we  apply  the  constant  time  algorithm  men¬ 
tioned  in  Remark  3.1  as  in  Step  4  of  the  inductive  base  algorithm. 

Put  the  n/Id(n)  minima  into  an  array  B .  The  difference  between  two  successive  elements  in 
B  is  at  most  kld(n). 

Step  2.  Out  of  the  series  of  preprocessing  algorithms  of  Lemma  4.2.1  apply  the  algorithm  for  d 
to  B ,  where  kf ,  the  difference  between  two  successive  elements  of  B  is  0(Id(n)).  This  will  take 
0(d)  time  and  vFn !ld(n }+n  processors  and  can  be  simulated  in  0(d)  time  using  n  processors. 
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Suppose  we  know  to  retrieve  a  range-minimum  query  within  each  of  the  subarrays  of  size 
ld(n)  in  constant  time.  It  should  be  clear  how  a  query  MIN(i,j)  can  then  be  retrieved  in  0(d) 
time.  Theorem  4.4.1  would  follow. 

Thus,  it  remains  to  show  how  to  preprocess  the  subarrays  of  size  I  jin)  in  constant  time  such 
that  a  range-minimum  query  within  a  subarray  can  be  retrieved  in  constant  time.  The  preprocess¬ 
ing  of  these  subarrays  is  done  in  steps  3.1,  3.2  and  3.3. 

Step  3.1.  For  each  subarray,  subtract  the  value  of  its  first  element  from  each  element  of  the 
subarray. 

Observe  that  following  this  subtraction  the  value  of  the  first  element  is  zero  and  the  difference 
between  each  pair  of  successive  elements  remains  at  most  k.  Step  3.2  constructs  a  table  with  the 
following  information:  For  any  Id(n  )-tuple  (ci,...,C/j(„)),  where  c^O  and  the  difference  between 
each  pair  of  successive  c,  values  is  at  most  jfe,  the  table  has  an  entry.  This  entry  gives  all 
Id(n)(Id(n)-\)/2  range-minima  with  respect  to  this  /^(n)-tuple. 

Step  3.2.  All  n  processors  together  build  a  table.  Each  entry  of  the  table  corresponds  to  one 
possible  allocation  of  values  to  the  Id(n)- tuple.  The  entry  will  provide  all  Id(n)(Id(n)~ l)/2 
range-minima  for  this  allocation. 

Observe  that  the  number  of  possible  allocations  is  (2/t+l)/‘<(n>_I.  To  see  this,  we  note  that  each 
possible  allocation  can  be  characterized  by  a  sequence  of  Id(n)~  1  numbers  taken  from  [-&,£]. 
This  will  indeed  be  the  number  of  entries  in  our  table.  Using  n  processors  (or  even  less)  the  table 
can  be  built  in  0(1)  time. 

Step  3.3.  The  only  difficulty  is  to  identify  the  table  entry  for  our  /^(n)-tuple 
since  once  we  reach  the  entry,  the  table  will  already  provide  the  desired  range-minima.  We  allo¬ 
cate  to  each  subarray  Id  (n )  processors.  For  each  subarray,  we  have  a  word  in  our  shared  memory 
with  (/<*(«  )-l)log(2&+l)  bits.  Processor  i  ,  1  <  j'  <  Id(n),  will  write  c,  -  c,_!  (which  is  a  number 
from  [-£,&])  starting  in  bit  number  (i  -2)log(2fc  +  l)  of  the  word  belonging  to  its  subarray  (bit  zero 
being  the  least  significant).  As  a  result  this  word  will  have  a  sequence  of  numbers  from  [-&,&] 
that  yields  the  desired  entry  in  our  table.  Note  that  exactly  ld(n)  processors  write  to  different  bits 
of  the  same  memory  word. 

Theorem  4.4.1  follows. 

4.5.  The  all  nearest  zero  bit  problem 

The  following  corollary  of  theorems  4.3.1  and  4.4.1  is  needed  for  Section  5. 

Corollary  45.1.  The  all  nearest  zero  bit  problem  is  almost  fully-parallel.  On  the  CRCW-bit 
PRAM  the  all  nearest  zero  bit  problem  is  fully-parallel. 


-  20  - 


Proof.  Recall  that  the  algorithm  for  the  restricted-domain  range-minima  problem  computes  all 
suffix-minima.  Recall  also  that  in  case  that  the  minimum  over  an  interval  is  not  unique,  the  left¬ 
most  minimum  is  found.  Thus  if  we  apply  the  restricted-domain  range-minima  algorithm  (with 
difference  between  successive  elements  at  most  one)  with  respect  to  A  then  the  minimum  over  the 
suffix  of  entry  t+1  gives  the  nearest  zero  to  the  right  of  entry  i.  Thus,  the  all  nearest  zero  bit  is 
actually  an  instance  of  the  restricted-domain  range-minima  problem  (with  difference  between  suc¬ 
cessive  elements  at  most  one).  It  follows  that  the  almost  fully-parallel  and  the  fully-parallel  algo¬ 
rithms  for  the  latter  apply  for  the  all  nearest  zero  bit  problem  as  well. 

5.  Almost  fully-parallel  reducibility 

We  demonstrate  how  to  use  the  *  -tree  data  structure  for  reducing  a  problem  A  into  another  prob¬ 
lem  5  by  an  almost  fully  parallel  algorithm.  We  apply  this  reduction  for  deriving  a  parallel  lower 
bound  for  problem  A  from  a  known  parallel  lower  bound  for  problem  B . 

Given  a  convex  polygon  with  n  vertices,  the  all  nearest  neighbors  (ANN)  problem  is  to  find 
for  each  vertex  of  the  polygon  its  nearest  (Euclidean)  neighbor. 

Theorem  5.1.  Any  CRCW  PRAM  algorithm  for  the  ANN  problem  that  uses  0(n\ogcn)  (for  any 
constant  c )  processors  needs  £2(loglog  n )  time. 

Proof.  We  give  below  an  almost  fully-parallel  reduction  from  the  problem  of  merging  two  sorted 
lists  of  length  n  each  to  the  ANN  problem  with  0(n)  vertices.  This  reduction  together  with  the 
following  lemma  imply  the  Theorem. 

Lemma.  Merging  two  sorted  lists  of  length  n  each  using  O  (n  logc  n )  (for  any  constant  c )  proces¬ 
sors  on  a  CRCW  PRAM  needs  Q(loglogn)  time. 

A  remark  in  [ScV-88a]  implies  that  Borodin  and  Hopcroft’s  ([BHo-85])  lower  bound  for  merging 
in  a  parallel  comparisons  model  can  be  extended  to  yield  the  Lemma. 

Proof  of  Theorem  5.1  (continued). 

The  reduction  (see  Figure  5.1): 

Let  A=(ava 2,  •  •  •  ,an)  and  B=(bx,b2i  •  •  •  ,bn)  be  two  increasing  lists  of  numbers  that  we  wish  to 
merge.  Assume,  without  loss  of  generality,  that  the  numbers  are  integers  and  that  ax  =b  j, 
an  =  bn  ■  (The  lower  bound  for  merging  assumes  that  the  numbers  are  integers.)  Consider  the  fol¬ 
lowing  auxiliary  problem:  For  each  1  <  i  <  n ,  find  the  minimum  index  j  such  that  b}  >  at .  The 
position  of  a,  in  the  merged  list  is  i+j- 1  and  therefore  an  algorithm  for  the  auxiliary  problem 
(together  with  a  similar  algorithm  for  finding  the  positions  of  the  bt  numbers  in  the  merged  list) 
suffices  for  the  merging  problem. 

We  give  an  almost  fully-parallel  reduction  from  the  auxiliary  problem  to  the  ANN  problem 
with  respect  to  the  following  convex  polygon. 
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Oi+d2 

Let  (^1  ,  c2  ,  •  ■  ■  ,  C2/1-1)  =  (ai  >  — 2 —  ’  a2  ’  — 2  ’  ’  0,1-1 

numbers  ct  ,  c2  ,  •  ■  ■  ,  c2„_i  form  an  increasing  list.  The 
(cb0)  ,  (c 2,0)  ,  •  ■  •  ,  (c2n_1»0)  ,  (bn>  1/4)  ,  (Vt.l/*)  .  '  •  •  .  (&  1,1/4). 


an-l+<3n  . 

- - - >  an). 


The 


convex  polygon  is 


In  [ScV-88a]  a  similar  construction  is  given  and  the  lower  bound  proof  then  follows  by  (non 
trivial)  Ramsey  theoretic  arguments. 


Let  D[l,  ■  ■  •  ,2/i-l]  be  a  binary  vector.  Each  vertex  (c/,0)  finds  its  nearest  neighbor  with 
respect  to  the  convex  polygon  (using  a  ’supposedly  existing’  algorithm  for  the  ANN  problem)  and 
assigns  the  following  into  vector  D  . 

If  the  nearest  vertex  is  of  the  form  (£>*,1/4) 
then  D  (/ )  :=  0 
else  D(l)  :=  1 

Next  we  apply  to  vector  D  the  almost  fully-parallel  algorithm  for  the  nearest  zero  bit  problem  of 
Subsection  4.5.  Finally,  we  show  how  to  solve  our  auxiliary  problem  with  respect  to  every  ele¬ 
ment  a,  .  We  break  into  two  cases  concerning  the  nearest  neighbor  of  (al  ,0)  (  =(c2j_i,0)). 

Case  (i).  The  nearest  neighbor  of  (a;,0)  is  a  vertex  (£>0,l/4).  Then,  the  minimum  index  j 
such  that  bj  >  is  either  a  or  a+1.  A  single  processor  can  determine  the  correct  value  of 
j  in  0(1)  time. 


Case  (ii).  Otherwise.  Then,  Z)(2i-l)=l.  The  nearest  zero  computation  gives  the  smallest 
index  k>2i-l  for  which  D(k)  =  0.  Let  the  nearest  neighbor  of  (c*,0)  be  (ba,  1/4)  .  Then 
j  =  a  is  the  minimum  index  for  which  bj  >  a( . 


6.  Computing  various  functions 

We  need  to  compute  certain  functions  during  our  algorithms.  For  each  2  <  m  <  a(n )  we  need 
A»- 1  («)  for  all  1  <  t  ^  lm(n).  Fortunately,  the  function  parameters  that  we  are  actually  concerned 
with  are  small,  relative  to  n.  For  instance,  in  order  to  compute  /3(/z)  =  log*/i,  it  is  enough  to 
compute  log’  (loglog  n )  since  log*  n  =  log*  (loglog  n  )+2. 

We  show  only  how  to  compute  lm(n)  for  each  2  <  m  £  a (n).  Computation  of  f£lt  (n  )  for 
2  <  m  <,  a (rt )  and  all  1  <  i  £  / m(n)  is  similar.  Our  computation  works  by  induction  on  m . 

The  inductive  hypothesis:  Let  x  =  loglog  n.  We  know  to  compute  the  following  values  in  0(m- 1) 
time  using  o(n/a(n))  processors:  (1)  /^(/t);  and  (2)  /m_10 )  for  all  1  <  y  S  loglog n. 

We  show  the  inductive  claim  (the  claim  itself  should  be  clear),  assuming  the  inductive 
hypothesis.  The  inductive  base  is  given  later.  First,  we  describe  (informally)  the  computation  of 
lm(x)  in  0(1)  (additional)  time  using  o(/i/a(n))  processors.  Consider  all  permutations  of  the 
numbers  1,  ■  •  jc.  The  number  of  these  permutations  is  (much)  less  than  n.  The  idea  is  to 
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identify  a  permutation  that  provides  the  sequence  [x  ,  Im-\(x )  ,  ijff  (x ) 

,  •  ■  •  ,  (x)  =  1  ,-•■].  So  if  Im-\(Pi)  =  Pi+ 1  for  all  0  <  /  <l:-l  we  conclude  Im(x)  =  k.  We 

can  check  this  condition  in  (9(1)  time  using  x  processors  per  permutation,  using  the  ability  of  the 
CRCW  PRAM  to  find  the  AND  of  x  bits  in  <9(1)  time.  The  total  number  of  processors  is 
o(n/a(n)).  We  make  two  remarks:  (1)  Computing  lm(y)  for  all  1  <  y  <  loglogn,  the  rest  of  the 
inductive  claim,  in  (9(1)  time  using  o{nla(n))  processors  is  similar.  (2)  there  are  easy  ways  for 
finding  all  permutations  in  0(1)  time  using  the  number  of  available  processors. 

We  finish  by  showing  the  inductive  base.  We  compute  logn  in  (9(1)  time  and  o(n/a(n) 
processors  as  follows.  If  n  is  given  in  a  binary  prepresentation  then  the  index  of  the  leftmost  one 
is  logn.  Following  [FRW-84]),  this  can  be  computed  in  (9(1)  using  as  many  processors  as  the 
number  of  bits  of  a  number.  By  iterating  this  we  get  log(2)n.  Finally,  we  find  logy  for  all 
1  <  y  <  loglog  n .  The  number  of  processors  used  for  this  computation  is  o  ( n  /a (n )). 
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